COMP5625M Assessment 2 - Image Caption Generation [100 marks]

The maximum marks for each part are shown in the section headers. The overall assessment carries a total of 100 marks.

This assessment is weighted 25% of the final grade for the module.

Motivation

Through this assessment, you will:

  1. Understand the principles of text pre-processing and vocabulary building.
  2. Gain experience working with an image-to-text model.
  3. Use and compare two text similarity metrics for evaluating an image-to-text model, and understand evaluation challenges.

Setup and resources

Having a GPU will speed up the image feature extraction process. If you want to use a GPU, please refer to the module website for recommended working environments with GPUs.

Please implement the coursework using PyTorch and Python-based libraries, and refer to the notebooks and exercises provided.

This assessment will use a subset of the COCO "Common Objects in Context" dataset for image caption generation. COCO contains 330K images of 80 object categories, and at least five textual reference captions per image. Our subset consists of nearly 5070 of these images, each with five or more different descriptions of the salient entities and activities, and we will refer to it as COCO_5070.

To download the data:

  1. Images and annotations: download the zipped file provided in the link here as COMP5625M_data_assessment_2.zip.

Info only: To understand more about the COCO dataset, you can look at the download page. We have already provided you with the "2017 Train/Val annotations (241MB)", but our image subset consists of fewer images than the original COCO dataset. So, no need to download anything from here!

  1. Image metadata: as our set is a subset of the full COCO dataset, we have created a CSV file containing relevant metadata for our particular subset of images. You can also download it from Drive, "coco_subset_meta.csv", at the same link as 1.

Submission

Please submit the following:

  1. Your completed Jupyter notebook file, in .ipynb format. Do not change the file name.
  2. The .html version of your notebook; File > Download as > HTML (.html). Check that all cells have been run and all outputs (including all graphs you would like to be marked) are displayed in the .html for marking.

Final note:

Please include everything you would like to be marked in this notebook, including figures. Under each section, put the relevant code containing your solution. You may re-use functions you defined previously, but any new code must be in the appropriate section. Feel free to add as many code cells as you need under each section.

Your student username (for example, sc15jb):

ml21s2j

Your full name:

Shengxuan Ji

Imports

Feel free to add to this section as needed.

Detect which device (CPU/GPU) to use.

The basic principle of our image-to-text model is as pictured in the diagram below, where an Encoder network encodes the input image as a feature vector by providing the outputs of the last convolutional layer of a pre-trained CNN (we use ResNet50). This pretrained network has been trained on the complete ImageNet dataset and is thus able to recognise common objects.

(Hint) You can alternatively use the COCO trained pretrained weights from PyTorch. One way to do this is use the "FasterRCNN_ResNet50_FPN_V2_Weights.COCO_V1" but use e.g., "resnet_model = model.backbone.body". Alternatively, you can use the checkpoint from your previous coursework where you finetuned to COCO dataset.

These features are then fed into a Decoder network along with the reference captions. As the image feature dimensions are large and sparse, the Decoder network includes a linear layer which downsizes them, followed by a batch normalisation layer to speed up training. Those resulting features, as well as the reference text captions, are then passed into a recurrent network (we will use RNN in this assessment).

The reference captions used to compute loss are represented as numerical vectors via an embedding layer whose weights are learned during training.

The Encoder-Decoder network could be coupled and trained end-to-end, without saving features to disk; however, this requires iterating through the entire image training set during training. We can make the training more efficient by decoupling the networks. Thus, we will:

First extract the feature representations of the images from the Encoder

Save these features (Part 1) such that during the training of the Decoder (Part 3), we only need to iterate over the image feature data and the reference captions.

Hint Try commenting out the feature extraction part once you have saved the embeddings. This way if you have to re-run the entire codes for some reason then you can only load these features.

Overview

  1. Extracting image features
  2. Text preparation of training and validation data
  3. Training the decoder
  4. Generating predictions on test data
  5. Caption evaluation via BLEU score
  6. Caption evaluation via Cosine similarity
  7. Comparing BLEU and Cosine similarity

1 Extracting image features [11 marks]

1.1 Design a encoder layer with pretrained ResNet50 (4 marks)

1.2 Image feature extraction step (7 marks)

1.1 Design a encoder layer with pretrained ResNet50 (4 marks)

Read through the template EncoderCNN class below and complete the class.

You are expected to use ResNet50 pretrained on imageNet provided in the Pytorch library (torchvision.models)

1.2 Image feature extraction step (7 marks)

Pass the images through the Encoder model, saving the resulting features for each image. You may like to use a Dataset and DataLoader to load the data in batches for faster processing, or you may choose to simply read in one image at a time from disk without any loaders.

Note that as this is a forward pass only, no gradients are needed. You will need to be able to match each image ID (the image name without file extension) with its features later, so we suggest either saving a dictionary of image ID: image features, or keeping a separate ordered list of image IDs.

Use this ImageNet transform provided.

2 Text preparation [23 marks]

2.1 Build the caption dataset (3 Marks)

2.2 Clean the captions (3 marks)

2.3 Split the data (3 marks)

2.4 Building the vocabulary (10 marks)

2.5 Prepare dataset using dataloader (4 marks)

2.1 Build the caption dataset (3 Marks)

All our selected COCO_5029 images are from the official 2017 train set.

The coco_subset_meta.csv file includes the image filenames and unique IDs of all the images in our subset. The id column corresponds to each unique image ID.

The COCO dataset includes many different types of annotations: bounding boxes, keypoints, reference captions, and more. We are interested in the captioning labels. Open captions_train2017.json from the zip file downloaded from the COCO website. You are welcome to come up with your own way of doing it, but we recommend using the json package to initially inspect the data, then the pandas package to look at the annotations (if you read in the file as data, then you can access the annotations dictionary as data['annotations']).

Use coco_subset_meta.csv to cross-reference with the annotations from captions_train2017.json to get all the reference captions for each image in COCO_5029.

For example, you may end up with data looking like this (this is a pandas DataFrame, but it could also be several lists, or some other data structure/s):

images matched to caption

2.2 Clean the captions (3 marks)

Create a cleaned version of each caption. If using dataframes, we suggest saving the cleaned captions in a new column; otherwise, if you are storing your data in some other way, create data structures as needed.

A cleaned caption should be all lowercase, and consist of only alphabet characters.

Print out 10 original captions next to their cleaned versions to facilitate marking.

images matched to caption

2.3 Split the data (3 marks)

Split the data 70/10/20% into train/validation/test sets. Be sure that each unique image (and all corresponding captions) only appear in a single set.

We provide the function below which, given a list of unique image IDs and a 3-split ratio, shuffles and returns a split of the image IDs.

If using a dataframe, df['image_id'].unique() will return the list of unique image IDs.

2.4 Building the vocabulary (10 marks)

The vocabulary consists of all the possible words which can be used - both as input into the model, and as output predictions, and we will build it using the cleaned words found in the reference captions from the training set. In the vocabulary each unique word is mapped to a unique integer (a Python dictionary object).

A Vocabulary object is provided for you below to use.

Collect all words from the cleaned captions in the training and validation sets, ignoring any words which appear 3 times or less; this should leave you with roughly 2200 words (plus or minus is fine). As the vocabulary size affects the embedding layer dimensions, it is better not to add the very infrequently used words to the vocabulary.

Create an instance of the Vocabulary() object and add all your words to it.

2.5 Prepare dataset using dataloader (4 marks)

Create a PyTorch Dataset class and a corresponding DataLoader for the inputs to the decoder. Create three sets: one each for training, validation, and test. Set shuffle=True for the training set DataLoader.

The Dataset function __getitem__(self, index) should return three Tensors:

  1. A Tensor of image features, dimension (1, 2048).
  2. A Tensor of integer word ids representing the reference caption; use your Vocabulary object to convert each word in the caption to a word ID. Be sure to add the word ID for the <end> token at the end of each caption, then fill in the the rest of the caption with the <pad> token so that each caption has uniform lenth (max sequence length) of 47.
  3. A Tensor of integers representing the true lengths of every caption in the batch (include the <end> token in the count).

Note that as each unique image has five or more (say, n) reference captions, each image feature will appear n times, once in each unique (feature, caption) pair.

Load one batch of the training set and print out the shape of each returned Tensor.

3 Train DecoderRNN [20 marks]

3.1 Design RNN-based decoder (10 marks)

3.2 Train your model with precomputed features (10 Marks)

3.1 Design a RNN-based decoder (10 marks)

Read through the DecoderRNN model below. First, complete the decoder by adding an RNN layer to the decoder where indicated, using the PyTorch API as reference.

Keep all the default parameters except for batch_first, which you may set to True.

In particular, understand the meaning of pack_padded_sequence() as used in forward(). Refer to the PyTorch pack_padded_sequence() documentation.

3.2 Train your model with precomputed features (10 marks)

Train the decoder by passing the features, reference captions, and targets to the decoder, then computing loss based on the outputs and the targets. Note that when passing the targets and model outputs to the loss function, the targets will also need to be formatted using pack_padded_sequence().

We recommend a batch size of around 64 (though feel free to adjust as necessary for your hardware).

We strongly recommend saving a checkpoint of your trained model after training so you don't need to re-train multiple times.

Display a graph of training and validation loss over epochs to justify your stopping point.

4 Generate predictions on test data [8 marks]

Display 5 sample test images containing different objects, along with your model’s generated captions and all the reference captions for each.

Remember that everything displayed in the submitted notebook and .html file will be marked, so be sure to run all relevant cells.

5 Caption evaluation using BLEU score [10 marks]

There are different methods for measuring the performance of image to text models. We will evaluate our model by measuring the text similarity between the generated caption and the reference captions, using two commonly used methods. Ther first method is known as Bilingual Evaluation Understudy (BLEU).

5.1 Average BLEU score on all data (5 marks)

5.2 Examplaire high and low score BLEU score samples (5 marks, at least two)

5.1 Average BLEU score on all data (5 marks)

One common way of comparing a generated text to a reference text is using BLEU. This article gives a good intuition to how the BLEU score is computed: https://machinelearningmastery.com/calculate-bleu-score-for-text-python/, and you may find an implementation online to use. One option is the NLTK implementation nltk.translate.bleu_score here: https://www.nltk.org/api/nltk.translate.bleu_score.html

Tip: BLEU scores can be weighted by ith-gram. Check that your scores make sense; and feel free to use a weighting that best matches the data. We will not be looking for specific score ranges; rather we will check that the scores are reasonable and meaningful given the captions.

Write the code to evaluate the trained model on the complete test set and calculate the BLEU score using the predictions, compared against all five references captions.

Display a histogram of the distribution of scores over the test set.

5.2 Examplaire high and low score BLEU score samples (5 marks)

Find one sample with high BLEU score and one with a low score, and display the model's predicted sentences, the BLEU scores, and the 5 reference captions.

6 Caption evaluation using cosine similarity [12 marks]

6.1 Cosine similarity (6 marks)

6.2 Cosine similarity examples (6 marks)

6.1 Cosine similarity (6 marks)

The cosine similarity measures the cosine of the angle between two vectors in n-dimensional space. The smaller the angle, the greater the similarity.

To use the cosine similarity to measure the similarity between the generated caption and the reference captions:

Calculate the cosine similarity using the model's predictions over the whole test set.

Display a histogram of the distribution of scores over the test set.

6.2 Cosine similarity examples (6 marks)

Find one sample with high cosine similarity score and one with a low score, and display the model's predicted sentences, the cosine similarity scores, and the 5 reference captions.

7 Comparing BLEU and Cosine similarity [16 marks]

7.1 Test set distribution of scores (6 marks)

7.2 Analysis of individual examples (10 marks)

7.1 Test set distribution of scores (6 marks)

Compare the model’s performance on the test set evaluated using BLEU and cosine similarity and discuss some weaknesses and strengths of each method (explain in words, in a text box below).

Please note, to compare the average test scores, you need to rescale the Cosine similarity scores [-1 to 1] to match the range of BLEU method [0.0 - 1.0].

BLEU:

Strengths:

Weaknesses:

Cosine Similarity:

Strengths:

Weaknesses:

In conclusion, BLEU emphasizes n-gram overlap, while cosine similarity captures overall semantic similarity between captions.

reference:

https://towardsdatascience.com/foundations-of-nlp-explained-bleu-score-and-wer-metrics-1a5ba06d812b#:~:text=Finally%2C%20to%20calculate%20the%20Bleu,Average%20of%20the%20Precision%20Scores.

https://www.machinelearningplus.com/nlp/cosine-similarity/#2whatiscosinesimilarityandwhyisitadvantageous

7.2 Analysis of individual examples (10 marks)

Find and display one example where both methods give similar scores and another example where they do not and discuss. Include both scores, predicted captions, and reference captions.

For the sample with similar scores, it means the predicted caption is similar to the reference captions in terms of both n-gram overlap and overall semantic meaning. However, if a key noun is missing, this might be one important reason why both indicators still give it a high score.

For the sample with dissimilar scores, the low BLEU score indicates that the predicted caption has little n-gram overlap with the reference captions. Meanwhile, the high cosine similarity score could be attributed to factors such as the length of the vectors having little influence on the result, and the frequent appearance of the word "a" in the target captions, leading to a high cosine similarity score.